We know that taste is very difficult to map and understand as different people have different preferences. However, at the same time, it’s well known that wines can have different levels of quality that yield a whole span of different prices. The goal of this document is to present a technical analysis over physical and chemical variables from Portuguese wines and try to shed some light on the relationships between physicochemical properties of wine and the rates given to the wines by wine experts.
We have deleted the variable listing the IDs.
Below we can find a summary statistics of all variables:
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
In our case, there is only one categorical variable which is quality (integers between 0 and 10).
Let’s start by analyzing the quality:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.636 6.000 8.000
As we can see, the median is around the rating 6.0 with the best wine rated at 8.0. Which means that’s very difficult to find good wines (>7) in this dataset as per statistical analysis. The plot shows that majority of wines are rates between 5 or 6. This also means that our analysis may be biased towards lower quality wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
The minimum alcohol percentage by volume is 8.4 and it ranges up to 14.90. The dataset is skewed towards the range of 10%, which is reasonable for most wines I know. I wonder if alcohol levels are actually relevant for a good rating as stronger wines may be more difficult to appreciate. Also it is a substance that evaporates easily and can affect smell.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free SO2 is also spread across a wide range. The minimum is 1 and maxiumum goes to 72 mg/dm^3. There is a higher concentration in the range of 7-20 mg/dm^3. This substance is related to oxidation of the wine and it’s well known that the oxidation affects the wine taste.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Total SO2 has a wide range of values with the Maximum at 289mg/dm³. Minimum at 6mg/dm³ and mean around 46mg/dm³ (highr than free SO2 as expected once free SO2 should be a part of the total)
After applying a log10 transformation to the plot we can now reduce the effect of the long tail and clearly see a more normal distribution around the value of 30g/dm³
Fixed and Volatile acidity show up with two peaks around 7-8g/dm³ for fixed and 0.4-0.6g/dm³. Fixed acidity ranges from 4.6 up to almost 16g/dm³ but the mean is around 8g/dm³. The Volatile acidity ranges from 0.12-1.58g/dm³ with Mean around 0.52g/dm³. Higher levels of volatile acidity (usually linked with acetic acid) can yield a bad taste (vinegar-like taste). The amount of non-volatile acid is way higher, because they compose the bulk of acidity on a wine.
Citric acid levels can be found at higher range than acetic acid, but they are still small compared to non-volatile. It ranges from 0-1g/dm³.
pH levels range from 2.74 to 4.01 which are within the acid range exactly as we expect from wines.
Chlorides usually are liked to the salty taste and should be found on small quantities. They range from 0.012 up to 0.611 with median around 0.08g/dm³. It can vary up to 50 time from the minimum up to maximum values on this dataset.
Higher quality wines seem to have a more concentrated range of chlorides <0.1 but we still need to dive deeper into the data to check if there’s any relationship between variables.
## [1] "fixed.acidity" "volatile.acidity" "citric.acid"
## [4] "residual.sugar" "chlorides" "free.sulfur.dioxide"
## [7] "total.sulfur.dioxide" "density" "pH"
## [10] "sulphates" "alcohol" "quality"
Variables:
According to the list above, we have 12 variables in which the first 11 are physicochemical properties that were measured and a final variable called quality that was a rate from 0 (very bad) to 10 (excellent) given by at least 3 wine experts. There are 1599 observations in this dataset. Most of the wines are in average quality (5-6 quality rates).
The main feature that is common to our knowldege and of many people that appreciate drinks is the alcohol level. We’d like to evaluate the correlation between alcohol levels and quality and use other variable to understand the dynamic of the quality as we believe that there’s a limit in which the alcohol level can affect taste positively.
investigation into your feature(s) of interest?
It’s a geeky analysis as there can be way more other variables that can affect the tast of wine such as temperature and how long the wine was in contact with oxygen. However, for this analysis we really want to focus on physicochecmical properties. We think that looking at common properties that we analyze in a wine such as alcohol percentage, acidity levels (including pH), total sulfur dioxide (related to oxidation) and residual sugar (related to sweetness) may be the most impactful ones.
Based on the description of the dataset, turns out that wines with levels of free SO2 higher than 50ppm (or mg/dm³) may have their taste affected. The SO2 mainly helps prevent oxidation, but can affect the taste if in high contration. Thus, we’ve created a categorical variable called level.free.sulfur.dioxide that splits the dataset into High(>50ppm) and Low (<=50ppm) levels of free SO2.
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?
Fixed and volatile acidity have shown 2 peaks each. Whereas Total SO2 has shown a long tail shape. We applied the log10 transform to total SO2 in order to reduce the skewness from the long tail and try to visualize it a more normal shape.
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.00000000 -0.256130895 0.67170343
## volatile.acidity -0.25613089 1.000000000 -0.55249568
## citric.acid 0.67170343 -0.552495685 1.00000000
## residual.sugar 0.11477672 0.001917882 0.14357716
## chlorides 0.09370519 0.061297772 0.20382291
## free.sulfur.dioxide -0.15379419 -0.010503827 -0.06097813
## total.sulfur.dioxide -0.11318144 0.076470005 0.03553302
## density 0.66804729 0.022026232 0.36494718
## pH -0.68297819 0.234937294 -0.54190414
## sulphates 0.18300566 -0.260986685 0.31277004
## alcohol -0.06166827 -0.202288027 0.10990325
## quality 0.12405165 -0.390557780 0.22637251
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.114776724 0.093705186 -0.153794193
## volatile.acidity 0.001917882 0.061297772 -0.010503827
## citric.acid 0.143577162 0.203822914 -0.060978129
## residual.sugar 1.000000000 0.055609535 0.187048995
## chlorides 0.055609535 1.000000000 0.005562147
## free.sulfur.dioxide 0.187048995 0.005562147 1.000000000
## total.sulfur.dioxide 0.203027882 0.047400468 0.667666450
## density 0.355283371 0.200632327 -0.021945831
## pH -0.085652422 -0.265026131 0.070377499
## sulphates 0.005527121 0.371260481 0.051657572
## alcohol 0.042075437 -0.221140545 -0.069408354
## quality 0.013731637 -0.128906560 -0.050656057
## total.sulfur.dioxide density pH
## fixed.acidity -0.11318144 0.66804729 -0.68297819
## volatile.acidity 0.07647000 0.02202623 0.23493729
## citric.acid 0.03553302 0.36494718 -0.54190414
## residual.sugar 0.20302788 0.35528337 -0.08565242
## chlorides 0.04740047 0.20063233 -0.26502613
## free.sulfur.dioxide 0.66766645 -0.02194583 0.07037750
## total.sulfur.dioxide 1.00000000 0.07126948 -0.06649456
## density 0.07126948 1.00000000 -0.34169933
## pH -0.06649456 -0.34169933 1.00000000
## sulphates 0.04294684 0.14850641 -0.19664760
## alcohol -0.20565394 -0.49617977 0.20563251
## quality -0.18510029 -0.17491923 -0.05773139
## sulphates alcohol quality
## fixed.acidity 0.183005664 -0.06166827 0.12405165
## volatile.acidity -0.260986685 -0.20228803 -0.39055778
## citric.acid 0.312770044 0.10990325 0.22637251
## residual.sugar 0.005527121 0.04207544 0.01373164
## chlorides 0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide 0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide 0.042946836 -0.20565394 -0.18510029
## density 0.148506412 -0.49617977 -0.17491923
## pH -0.196647602 0.20563251 -0.05773139
## sulphates 1.000000000 0.09359475 0.25139708
## alcohol 0.093594750 1.00000000 0.47616632
## quality 0.251397079 0.47616632 1.00000000
pairs.panels(data, scale=TRUE)
After checking the correlation between all variables in the dataset we could see some high correlation between certain variabels:
fixed.acidity vs. density
fixed.acidity vs. citric.acid
fixed.acidity vs. pH
free.sulfur.dioxide vs. total.sulfur.dioxide
alcohol vs. density
quality vs. alcohol
We can see the high positive correlation between fixed acidity and density. However a bit scattered in the central part.
We can see some correlation between citric acid and fixed acidity. Not as strong as others but still there.
A clear correlation between fixed acidity and pH. Higher fixed acidity yields lower pH, which is the definition of that index.
Total and free sulfur dioxide also present some positive correlation as free sulfur dioxide levels depends on the molecular levels of sulfur dioxide.
As we can see from above, the definition of High vs Low Free SO2 correlates with the 99th percentile of free sulfur dioxide. In other words, the cases with High Free SO2 are outliers and don’t represent the dataset. Therefore, we have pretty much all wines under the 50ppm defined as ok amount of Free SO2 by the description of the dataset. From now on we won’t use the variable created as it doesn’t bring much value.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
investigation. How did the feature(s) of interest vary with other features in
the dataset?
Higher alcohol levels are correlated to higher quality.
Volatile acidity is correlated (negatively) with quality. From the plots above we can clearly see that higher quality wines have lower volatile acidity in a more controlled range.
(not the main feature(s) of interest)?
Fixed acidity and density seems to have a strong positive correlation.
Fixed acidity and Citric acid levels seem to have also very positive correlation. It’s expected as fixed acidity is related to the acids in the wine.
Fixed acidity levels and pH have a strong correlation as well but negative, which is more than expected as higher acidity levels are linked to lower pH level by definition.
Total SO2 and Free SO2 also seems to have positive correlation and it’s expected as more SO2 will eventually lead to more free SO2 on chemical mixtures. The degree will depend on the oxidation levels of the mixture, but it seems that for wines it’s pretty much correlated.
Interestingly enough, density and alcohol also are highly correlated (negatively). Higher levels of alcohol yields lower density wines. It’s very scattered so the correlation is not strong as previous ones. But still very correlated (-0.5).
The strongest relationship found is between fixed acidity and pH, which is expected as pH is function of the acidity levels of a substance. Also we have strong correlation between total sulfur dioxide and free sulfur dioxide levels, which also makes sense as both depends on the levels of sulfur dioxide in the substance at a molecular levels.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Higher quality wines tend to stay on the lower right part of the plot, which means higher fixed acidity and lower volatile acidity.
Higher levels of alcohol and fixed acity correlates with higher quality wines, we can see the wines >=7 in terms of quality stays on the top half of the plot.
Good wines concentrate on the left side of the plot, we can see some correlation between higher quality wines and lower levels of total sulfur dioxide, in terms of free sulfur dioxide it’s not that clear but they tend to stay in a region of lower concentration.
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?
If we look at volatile acidity vs fixed acidity vs quality we can see there is a “sweet spot” area on the bottom right corner where lies the top quality wines (quality = 8) surronded by wines quality 5, 6 and 7. It’s not 100% correlated but we can see a higher density of 7’s and 8’s down there compared to the opposite corner with higher volatile acidity and lower fixed acidity.
Also, if we look at the alcohol vs fixed acidity vs quality we can see that there is also another area with higher concentration of top quality wines (specially if we focus on 7’s [yellow dots] and 8’s [black dots]) on the top right corner with higher alcohol percentage and higher fixed acidity. Which aligns with our analysis from previous sections. It seems that alcohol levels and fixed acidity contributes for a higher quality wine perception. If we do the inverse excercise and look at the bottom part of the chart we can see almost no wines above 6 in terms of quality.
Finally, when looking at free sulfur dioxide vs total sulfur dioxide vs quality we can see majority of wines concentrate in a range of <25 for free sulfur dioxide but without much correlation with quality, which makes sense as the property is just related to conserving the wine. On the other hand, we can see higher quality wines staying at lower levels of total sulfur dioxide. It’s known that too high levels of sulfur dioxide may alter the taste of wine.
It seems there’s a cluster towards lower levels of free sulfur dioxide and total sulfur dioxide. It makes sense as sulfur dioxide is used mainly for conservation purposes. Therefore, it should have a standard concentration amongst wines with a delimited range. More than a certain concentration would only ruin the wine, lower would make it too perishable.
strengths and limitations of your model.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
From all the properties analyzed, Alcohol levels and density have the strongest correlation (approximately 0.67). The Plot One demonstrates the correlation and has a linear model that shows the trend and error based on geom_smooth function from the library ggplot2 as a representation of the correlation.
Plot Two shows us the correlation between higher quality wines, lower levels of volatile acidity and higher levels of fixed acidity. In other words, the higher quality wines concentrate at the bottom right corner of the plot.
Plot Three we can see that higher quality wines (quality levels around 7 and 8) have higher alcohol % over volume. It seems that this variable has a strong relavance on defining the wine quality.Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
We have performed an Exploratory Data Analysis over the dataset related to red variant of the Portuguese “Vinho Verde” wine. We tried to first understand the dataset and it’s variables, look for validations, for instance the low pH levels as expected from wines due to acidity. Similarly, we looked into trends and how the quality of the wines was distributed. Aftewards, we started to look for correlations and trends between different variables. Finally, we looked at those same trends and their correlation with quality, the main variable of interest in this dataset. Below we can find few highlights over this process.
It seems there’s only a variable with stronger corrlation with quality which is alcohol levels. We were able to see that higher quality wines (quality levels around 7 and 8) have higher alcohol percentages. Maybe stronger wines use more grapes or higher quality grapes that have more sugar yileding better wines. Or maybe the taste of the wine as an alcohol drink is also linked to the taste of alcohol. Abscence of alcohol may also affect the smells as it is a very volatile substance.
Volatile acidity and Fixed acidity also have some correlation with the quality. We could interpret it as more “stable” wines with higher levels of fixed acidity and lower levels of volatile acidity tend to have better quality. The volatile acidity is linked to acetic acid that can potentially bring some bad taste (like vinegar). Hence, higher levels of volatile acidity would definetely impact the taste and quality negatively. On the other hand, fixed acidity is linked to the pH which measuares how basic or acid a substance is and by definition can affect the taste as our mouth reacts differently for different levels of pH. We don’t have enough data to show that the variation on the dataset would actually be percieved by a human paladar.
Overall we had a lot of variables to look at, we ended up focusing on a few that sounded as they would impact taste. Of course, we could expand this analysis endelessly, but given the time and knowledge constraint we chose a few variables and we didnt perform any relevant transformations. Segragating the dataset between higher and lower levels of sulfur dioxide showed no results whatsoever.
For a next step, we would like to look into prices, costs and brands and how they would affect the quality evaluations. Also, trying to understand better the evalutaion process given the weather and temperature as it definely affects the sensorial experiences.
Citation: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.